Word/figure count

Words: [The number of words in your document, calculated using the word_count() function at the end of this document] Figures: [The number of figures in your document. You can just count these]

Locations on GitHub

  • The RMarkdown file that produced this HTML document can be found here: data_cleaning.Rmd.

Data Cleaning

The purpose of this .Rmd file is to search and take note/keep a log of any dodgy data points.

Package Loading

If you are unfamiliar with some of these packages, I’ve left some short code comments for you to find more information.

suppressPackageStartupMessages({
  ### Tidyverse Collection -----------------------------------------------------
  # The tidyverse is an opinionated collection of R packages designed for
  # data science. All packages share an underlying design philosophy, grammar,
  # and data structures.
  ### https://www.tidyverse.org/
  library(tidyverse)
  
  
  
  ### Data Exploration ---------------------------------------------------------
  # To easily display summary statistics
  ### https://github.com/ropensci/skimr
  library(skimr)
  
  
  
  ### Data Cleaning/Wrangling --------------------------------------------------
  # To easily examine and clean dirty data
  ### https://www.rdocumentation.org/packages/janitor/versions/2.2.0
  library(janitor)
  
  # To easily use date-time data
  ### https://lubridate.tidyverse.org/
  library(lubridate)
  
  # To easily handle categorical variables using factors.
  ### https://forcats.tidyverse.org/
  library(forcats)
  
  
  
  ### Data Visualisation -------------------------------------------------------
  # To easily use already-made themes for data visualisation
  ### https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/
  library(ggthemes)
  
  # To combine a scatter plot and a 2D density plot
  ### https://github.com/LKremer/ggpointdensity
  library(ggpointdensity)
  
  
  
  ### Spatial Visualisation ----------------------------------------------------
  # To easily manipulate spatial data
  ### https://r-spatial.github.io/sf/
  library(sf)
  
  # This may take a little while to install if you don't have absmapsdata already.
  # Whilst I'm only using one shapefile from the package, I thought acquiring
  # the data through a package would be more standardised/reproducible than
  # downloading it straight from the ABS.
  options(timeout = 1000)
  devtools::install_github("wfmackey/absmapsdata")
  # To easily access the Australian Bureau of Statistics (ABS) spatial structures
  ### https://github.com/wfmackey/absmapsdata
  library(absmapsdata)
  
  # To easily interact spatial data and ggplot2
  ### https://paleolimbot.github.io/ggspatial/
  library(ggspatial)
  
  
  
  ### Misc ---------------------------------------------------------------------
  # To easily standardise naming conventions based upon a consistent design
  # philosophy
  ### https://github.com/Tazinho/snakecase
  library(snakecase)
  
  # To locate and download species observations from the Atlas of Living
  # Australia
  ### https://galah.ala.org.au/
  library(galah)
  
  # To easily enable file referencing in project-oriented workflows
  ### https://here.r-lib.org/
  library(here)
})
## WARNING: Rtools is required to build R packages, but is not currently installed.
## 
## Please download and install Rtools 4.2 from https://cran.r-project.org/bin/windows/Rtools/.
## Skipping install of 'absmapsdata' from a github remote, the SHA1 (513415b9) has not changed since last install.
##   Use `force = TRUE` to force installation

Data Loading and Sanity Checks

Packages are all loaded in! Let’s read in a specific invasive animal species for a particular state and do some sanity checks. Ideally, we will functionalise this code to do this sanity checks on all the invasive animal species that we end up choosing for each state. Arguably, this data cleaning could be a package/RShiny dashboard in of itself!

The following code is mainly reused and inspired by the ALA Labs - An exploration of dingo observations in the ALA.

First, let’s log into the Atlas of Australia using the galah_config() function.

# Ref [1]
# Use an Atlas of Australia-registered email (Register at ala.org.au)
galah_config(email = "johann.wagner@gmail.com")

# References:
# - [1] https://labs.ala.org.au/posts/2023-05-16_dingoes/post.html

Rabbits in the ACT

Let’s focus on exploring rabbits in the ACT. We will scale to other species and other states/territories later.

# Ref [1, 4]
# To download data from the ALA
rabbits <- galah_call() %>%
  
  # To conduct a search for the scientific name of the European Rabbit
  galah_identify("oryctolagus cuniculus") %>% # Ref [2]
  
  # Filter records observed in the ACT
  galah_filter(cl22 == "Australian Capital Territory") %>% 
  
  # Pre-applied filter to ensure quality-assured data
  # the "ALA" profile is designed to exclude lower quality records Ref [3]
  galah_apply_profile(ALA) %>% 
  atlas_occurrences()
## This query will return 1,855 records
## 
## Checking queue
## Current queue size: 1 inqueue .
rabbits
# References:
# - [1] https://labs.ala.org.au/posts/2023-05-16_dingoes/post.html
# - [2] https://bie.ala.org.au/species/https://biodiversity.org.au/afd/taxa/692effa3-b719-495f-a86f-ce89e2981652
# - [3] https://galah.ala.org.au/R/reference/galah_apply_profile.html
# - [4] https://galah.ala.org.au/R/reference/galah.html

Let’s use the skim() function to explore this dataset.

skim(rabbits)
Data summary
Name rabbits
Number of rows 1855
Number of columns 8
_______________________
Column type frequency:
character 5
numeric 2
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
scientificName 0 1 21 31 0 2 0
taxonConceptID 0 1 73 73 0 2 0
recordID 0 1 36 36 0 1855 0
dataResourceName 0 1 9 58 0 9 0
occurrenceStatus 0 1 7 7 0 1 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
decimalLatitude 0 1 -35.33 0.14 -35.89 -35.35 -35.28 -35.26 -35.12 ▁▁▁▇▇
decimalLongitude 0 1 149.12 0.13 148.79 149.11 149.14 149.16 150.77 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
eventDate 0 1 1964-09-02 2023-09-28 04:34:38 2017-01-01 1088

Interestingly, it seems that there are two unique scientificName, this is unexpected. Similarly, there are also two unique values for taxonConceptID. There are 1861 occurrences of rabbits in the ACT from 9 different data providers (dataResourceName). The most recent observation is 2023-09-28 04:34:38 and the earliest is 1964-09-02. All the observations have longitude and latitude values, we will plot these visually in a moment to ensure they are all in the ACT.

It will be good to create a function that checks that each observation is within the spatial boundaries of each respective state/territory.

Potential dodgy rabbit data

Let’s explore what the two unique scientificName values are.

rabbits %>% 
  group_by(scientificName) %>% 
  count()

Oh wow! This is our first data cleaning pick up! Even with the pre-applied ALA filter we have found an error in one of the values in scientificName. Let’s find that specific data point.

rabbits %>% 
  filter(scientificName == "Oryctolagus cuniculus cuniculus")

Create suspicious data

Looking at the taxonConceptID values/websites. The “Oryctolagus cuniculus cuniculus” is actually a subspecies, not a typo! Now, that we know it is a subspecies and not a typo, let’s note this value down in our suspicious_data dataset and reassess later, but keep it in our rabbits dataset.

# Let's reuse the same columns/column names as in rabbits
rabbits_suspicious <- rabbits %>% 
  filter(recordID == FALSE) %>% 
  mutate(
    suspicious_notes = character(),
    date             = character()
    ) %>% 
  
  # Add the suspicious data
  add_row({
    rabbits %>% 
      filter(scientificName == "Oryctolagus cuniculus cuniculus") %>% 
      mutate(
        suspicious_notes = "This is a subspecies of the Oryctolagus cuniculus species and is the only one in the ACT.",
        date = "2023-10-05"
        )
  })

rabbits_suspicious

This suspicious data point is from the iNaturalist Australia data provider, which is a Citizen Science project, so potentially more cautious investigation should go into these observations. Let’s continue onto the spatial visualisation.

Spatial visualisation of rabbits in the ACT

Let’s visualise the rabbits data spatially by creating a map with data points of the observations.

# Ref [1]: Create spatial visualisation of the ACT
state2021 %>% 
  filter(state_name_2021 == "Australian Capital Territory") %>% 
  ggplot() +
  
  # Ref [1]: Create background polygon of the ACT
  geom_sf(
    aes(geometry = geometry)
  ) +
  
  geom_point(
    data = rabbits,
    aes(
      x = decimalLongitude,
      y = decimalLatitude
    ),
    alpha = 0.6
  ) +
  
  coord_sf() +
  
  theme_minimal()

# References:
# - [1] https://github.com/wfmackey/absmapsdata/tree/master

Interestingly, it seems like there are a select number of observations outside of the ACT borders, which I assume is Jervis Bay territory. Let’s spatially isolate these data points and see if they really are in the Jervis Bay territory.

# Ref [1]: Create spatial visualisation of the ACT
state2021 %>% 
  filter(state_name_2021 == "Australian Capital Territory") %>% 
  ggplot() +
  
  # Ref [1]: Create background polygon of the ACT
  geom_sf(
    aes(geometry = geometry)
  ) +
  
  # Plot the Jervis Bay polygon
  geom_sf(
    data = {
      suburb2021 %>%
        filter(suburb_name_2021 == "Jervis Bay")
      },
    aes(geometry = geometry)
  ) +
  
  geom_point(
    data = rabbits,
    aes(
      x = decimalLongitude,
      y = decimalLatitude
    ),
    alpha = 0.6
  ) +
  
  coord_sf() +
  
  theme_minimal()

# References:
# - [1] https://github.com/wfmackey/absmapsdata/tree/master

Hypothesis has been confirmed! The data points are from the Jervis Bay territory.

Exclusion of Jervis Bay territory data

For the sake/scope of this assignment, we will put these data points into the excluded_data dataset and remove these from our main dataset. Because I really only want to showcase observations in the main state/territory borders, as this RShiny Dashboard tool should be used for macro-scale, not small territories.

Let’s create an sf data type so that we can use the st_filter() function, which filters the spatial data points within a given spatial polygon. In our case, we only want to include the data points that are within the ACT border boundary polygon.

rabbits_sf_clean <- rabbits %>% 
  
  # Ref [1]: Convert rabbits tibble into an sf data type
  st_as_sf(
    coords = c("decimalLongitude", "decimalLatitude"),
    crs = "+proj=longlat +datum=WGS84"
  ) %>% 
  
  # Ref [2]: Filter the data points that are within the ACT
    st_filter({
      state2021 %>% 
    filter(state_name_2021 == "Australian Capital Territory")}
    )

rabbits_sf_clean
# References:
# - [1] https://stackoverflow.com/a/52951856/22410914
# - [2] https://rdrr.io/cran/sf/man/st_as_sf.html

We’ve now filtered out the excluded data points; however, we want to keep using the tibble data type, so let’s do some joins.

rabbits_clean <- rabbits %>% 
  
  # Find all the clean observations
  right_join(
    rabbits_sf_clean,
    join_by(
      eventDate,
      scientificName,
      taxonConceptID,
      recordID,
      dataResourceName,
      occurrenceStatus
      )
    )

rabbits_clean

This is now the clean rabbits dataset for the ACT.

Let’s look at the excluded datasets.

rabbits_excluded <- rabbits %>% 
  
  # Find all the excluded observations
  anti_join(
    rabbits_sf_clean,
    join_by(
      eventDate,
      scientificName,
      taxonConceptID,
      recordID,
      dataResourceName,
      occurrenceStatus
      )
    )

rabbits_excluded

Let’s plot these points just for sanity checks.

# Ref [1]: Create spatial visualisation of the ACT
state2021 %>% 
  filter(state_name_2021 == "Australian Capital Territory") %>% 
  ggplot() +
  
  # Ref [1]: Create background polygon of the ACT
  geom_sf(
    aes(geometry = geometry)
  ) +
  
  # Plot the Jervis Bay polygon
  geom_sf(
    data = {
      suburb2021 %>%
        filter(suburb_name_2021 == "Jervis Bay")
      },
    aes(geometry = geometry)
  ) +
  
  geom_point(
    data = rabbits_excluded,
    aes(
      x = decimalLongitude,
      y = decimalLatitude
    ),
    alpha = 0.6
  ) +
  
  coord_sf() +
  
  theme_minimal() +
  
  labs(title = "Excluded Data Points")

# References:
# - [1] https://github.com/wfmackey/absmapsdata/tree/master

It seems like we have picked up a few extra points that seem to be just outside the ACT border boundaries. Because of the scope of the assignment, we will keep excluding all of these points. It is likely that this is either a misalignment with the ACT boundary from the ABS and the mapping data input software used by the data providers. It could also be that the borders have slightly changed in the past as the ABS boundary is the one used in 2021. There could also be an error in the GPS location data that is causing these points to be outside the ACT boundary. Regardless of the reason, due to the scope of this assignment, these points will be excluded.

Create excluded data

# Let's reuse the same columns/column names as in rabbits
rabbits_excluded <- rabbits_excluded %>% 
  
  # Add excluded_notes
  mutate(
    date = "2023-10-05",
    
    excluded_notes = "These data points are outside the ABS ACT boundary. They include points that are just outside the boundary within a few kilometers and points that are in Jervis Bay"
    )

rabbits_excluded

Final Spatial Visualisation

Let’s make a final spatial visualisation.

# Ref [1]: Create spatial visualisation of the ACT
state2021 %>% 
  filter(state_name_2021 == "Australian Capital Territory") %>% 
  ggplot() +
  
  # Ref [1]: Create background polygon of the ACT
  geom_sf(
    aes(geometry = geometry)
  ) +
  
  geom_point(
    data = rabbits_clean,
    aes(
      x = decimalLongitude,
      y = decimalLatitude
    ),
    alpha = 0.6
  ) +
  
  coord_sf() +
  
  theme_minimal() +
  
  # Ref [2]
  labs(title = "Rabbit observations in the ACT")

# References:
# - [1] https://github.com/wfmackey/absmapsdata/tree/master
# - [2] https://labs.ala.org.au/posts/2023-05-16_dingoes/post.html

Scaling from a single state to all of Australia

Ideally, we want to create functions that do all of this cleaning for us.

load_galah_occurrence_data function

Let’s start with a function that extracts the raw data for a particular species for all of Australia. We will sort the state/territory later.

load_galah_occurrence_data <- function(
  # character string of scientific name of species
  species_name
  ) {
  species_occurrence <- galah_call() %>%
  
  # To conduct a search for species scientific name
  galah_identify(species_name) %>% # Ref [2]
  
  # Pre-applied filter to ensure quality-assured data
  # the "ALA" profile is designed to exclude lower quality records Ref [3]
  galah_apply_profile(ALA) %>% 
  atlas_occurrences()
  
  return(species_occurrence)
}

Rabbits in all of Australia

Let’s test the load_galah_occurrence_data function.

rabbits_all_states <- load_galah_occurrence_data(
  species_name = "oryctolagus cuniculus"
  )
## This query will return 113,383 records
## 
## Checking queue
## Current queue size: 1 inqueue . running ......
rabbits_all_states
skim(rabbits_all_states)
Data summary
Name rabbits_all_states
Number of rows 113383
Number of columns 8
_______________________
Column type frequency:
character 5
numeric 2
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
scientificName 0 1 21 31 0 2 0
taxonConceptID 0 1 73 73 0 2 0
recordID 0 1 36 36 0 113383 0
dataResourceName 0 1 8 60 0 36 0
occurrenceStatus 0 1 7 7 0 1 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
decimalLatitude 151 1 -33.84 4.28 -54.58 -36.85 -34.25 -32.77 64.13 ▇▂▁▁▁
decimalLongitude 151 1 143.30 7.63 -122.80 140.43 144.28 149.09 175.70 ▁▁▁▁▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
eventDate 3097 0.97 1824-01-01 2023-10-01 01:12:00 2016-04-25 17498

It seems like there are some missing values for decimalLatitude and decimalLongitude, as well as for eventDate. Let’s explore these values.

rabbits_all_states %>% 
  filter(is.na(decimalLatitude)) %>% 
  group_by(dataResourceName) %>% 
  count() %>% 
  arrange(desc(n))
rabbits_all_states %>% 
  filter(is.na(eventDate)) %>% 
  group_by(dataResourceName) %>% 
  count() %>% 
  arrange(desc(n))

It seems that most of these missing values are from museum providers for OZCAM. It also seems that the Tasmanian Values Atlas have not provided any spatial information. Let’s put all of these data points into the excluded_data dataset.

remove_na_values and add_na_values_to_excluded_data functions

Let’s create a function that will do this automatically.

remove_na_values <- function(species_occurrence) {
  species_occurrence_clean <- species_occurrence %>% 
    filter(
      !is.na(eventDate),
      !is.na(decimalLatitude),
      !is.na(decimalLongitude)
    )
  
  return(species_occurrence_clean)
}

add_na_eventdate_to_excluded_data <- function(species_occurrence) {
  # Let's reuse the same columns/column names as in rabbits
  add_excluded <- species_occurrence %>% 
    
    filter(is.na(eventDate)) %>% 
    
    # Add excluded_notes
    mutate(
      date = Sys.Date(),
      
      excluded_notes = "These data points have a missing eventDate value."
      )
  
  excluded_data <- bind_rows(excluded_data, add_excluded)
  return(excluded_data)
}

add_na_spatial_to_excluded_data <- function(species_occurrence) {
  # Let's reuse the same columns/column names as in rabbits
  add_excluded <- species_occurrence %>% 
    
    filter(
      is.na(decimalLatitude),
      is.na(decimalLongitude)
    ) %>% 
    
    # Add excluded_notes
    mutate(
      date = Sys.Date(),
      
      excluded_notes = "These data points have missing spatial values."
      )
  
  excluded_data <- bind_rows(excluded_data, add_excluded)
  return(excluded_data)
}

Now, let’s use the functions and remove and log the NA values.

rabbits_all_states_clean <- rabbits_all_states %>% 
  remove_na_values()
  
# Initialise empty excluded_data tibble.
excluded_data <- tibble()

excluded_data <- rabbits_all_states %>% 
  add_na_eventdate_to_excluded_data()

excluded_data <- rabbits_all_states %>% 
  add_na_spatial_to_excluded_data()

rabbits_all_states_clean
excluded_data

Figuring out which point is in which state

Now, let’s convert rabbits_all_states_clean into an sf so we can find out which point is within each state/territory. Then we can use the st_filter function to filter out each point that is within each respective state and create a new column called state that will tell us which point is in which state.

# Ref [1,2]
add_state_column <- function(species_all_states_clean_input) {
  
  # Make sure both your tibble and shapefile have the same coordinate reference
  # system (CRS)
  species_all_states_clean_sf <- st_as_sf(
  species_all_states_clean_input,
  coords = c("decimalLongitude", "decimalLatitude"),
  crs = st_crs(state2021)
  )
  
  # Create a vector of state/territory names
  state_names <- c(
    "New South Wales",
    "Victoria",
    "Queensland",
    "South Australia",
    "Western Australia",
    "Tasmania",
    "Northern Territory",
    "Australian Capital Territory"
  )
  
  # Initialise empty list
  species_all_states_clean_list <- list()
  
  # Ref [3]: Create a separate tibble for all points within each state and
  # put in list
  for (state_name in state_names) {
    species_state_clean <- species_all_states_clean_sf %>% 
    st_filter({state2021 %>% filter(state_name_2021 == state_name)}) %>% 
    mutate(state = state_name) %>% 
    as_tibble()
    
    species_all_states_clean_list[[state_name]] <- species_state_clean
  }
  
  # Bind all 8 tibbles back into one big tibble
  species_all_states_clean_output <- species_all_states_clean_input %>% 
    
  # Join the new columns (geometry, state) to the input tibble
  left_join(
    {bind_rows(species_all_states_clean_list)},
    join_by(
      eventDate,
      scientificName,
      taxonConceptID,
      recordID,
      dataResourceName,
      occurrenceStatus
      )
    )
  
  return(species_all_states_clean_output)
}



rabbits_all_states_clean <- rabbits_all_states %>% 
  remove_na_values() %>% 
  add_state_column()

rabbits_all_states_clean
# References:
# - [1] https://rdrr.io/cran/sf/man/st_as_sf.html
# - [2] https://chat.openai.com/share/3e22a588-45d1-4959-923b-ac85da5e34b0
# - [3] https://chat.openai.com/share/81f6f18f-e6e6-4c41-8750-c78e45e0c16e

Sanity checking the spatial data

Great! Now we have a cleaned tibble, where we have removed values that have missing spatial data and missing eventDate values, as well as added a state column indicating, which point is in which state. Let’s quickly do some sanity checks on the state column.

rabbits_all_states_clean %>% 
  group_by(state) %>% 
  count() %>% 
  arrange(desc(n))

It seems like there are several points that are not in any of the state/territory polygons. Let’s investigate these as we will likely just exclude these points.

state2021 %>% 
  ggplot() +
  
  geom_sf(
    aes(geometry = geometry)
  ) +
  
  geom_point(
    data = {
      rabbits_all_states_clean %>% 
        mutate(
          state = case_when(
            is.na(state) ~ "Not within any state/territory",
            .default = state
          )
        )
      },
    aes(
      x = decimalLongitude,
      y = decimalLatitude,
      colour = state
    ),
    alpha = 0.6
  ) +
  
  scale_colour_viridis_d() +
  
  coord_sf() +
  
  theme_minimal() +
  
  labs(title = "Rabbit observations in Australia")

Ah hah! Interestingly, the dataset that I am using must not be filtered for just observations in Australia. Good thing we did this sanity check! Let’s just show the not within any state/territory points, just to make sure.

state2021 %>% 
  ggplot() +
  
  geom_sf(
    aes(geometry = geometry)
  ) +
  
  geom_point(
    data = {
      rabbits_all_states_clean %>% 
        mutate(
          state = case_when(
            is.na(state) ~ "Not within any state/territory",
            .default = state
          )
        ) %>% 
        filter(state == "Not within any state/territory")
      },
    aes(
      x = decimalLongitude,
      y = decimalLatitude,
      colour = state
    ),
    alpha = 0.6
  ) +
  
  scale_colour_viridis_d() +
  
  coord_sf() +
  
  theme_minimal() +
  
  labs(title = "Rabbit observations in Australia",
       subtitle = "Only observations that didn't fall within any state/territory")

### Ocean Rabbits? Let’s exclude them! Interestingly, it seems that there are quite a few observations that must be along the coastline / in the water. Let’s just look at just the southern coastline of Victoria.

state2021 %>% 
  ggplot() +
  
  geom_sf(
    aes(geometry = geometry)
  ) +
  
  geom_point(
    data = {
      rabbits_all_states_clean %>% 
        mutate(
          state = case_when(
            is.na(state) ~ "Not within any state/territory",
            .default = state
          )
        ) %>% 
        filter(state == "Not within any state/territory")
      },
    aes(
      x = decimalLongitude,
      y = decimalLatitude,
      colour = state
    ),
    alpha = 0.6
  ) +
  
  scale_colour_viridis_d() +
  
  coord_sf(
    xlim = c(142.0, 150.0),
    ylim = c(-40, -35)
  ) +
  
  theme_minimal() +
  
  labs(title = "Rabbit observations in southern Victoria",
       subtitle = "Points that are not within any state/territory")

I have never heard of any ocean rabbits, so let’s just take a blanket assumption and exclude these results. There are probably some uncertainty in the GPS measurements or data entry errors or potentially islands that don’t come up on the ABS maps or even a misalignment between the ABS data and the ALA spatial data. Regardless, of the reason, let’s put all of these measurements into the excluded_data dataset.

add_not_within_abs_map_to_excluded_data

Let’s create a function that adds these “ocean rabbits” to the excluded_data dataset.

remove_not_within_abs_map_values <- function(species_all_states_clean) {
  species_all_states_clean <- species_all_states_clean %>% 
    filter(
      !is.na(state)
    )
  
  return(species_all_states_clean)
}

add_not_within_abs_map_to_excluded_data <- function(species_all_states_clean) {
  # Let's reuse the same columns/column names as in rabbits
  add_excluded <- species_all_states_clean %>% 
    
    filter(is.na(state)) %>% 
    
    # Add excluded_notes
    mutate(
      date = Sys.Date(),
      
      excluded_notes = "These data points are not within any of the 8 states and territories ABS borders. They are either overseas data points or are just off the coastline of Australia."
      )
  
  excluded_data <- bind_rows(excluded_data, add_excluded)
  return(excluded_data)
}
excluded_data <- rabbits_all_states_clean %>% 
  add_not_within_abs_map_to_excluded_data()

Now our final set of functions are as follows:

rabbits_all_states_clean <- rabbits_all_states %>% 
  remove_na_values() %>% 
  add_state_column() %>% 
  remove_not_within_abs_map_values()

skim(rabbits_all_states_clean)
## Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined
## `sfl` provided. Falling back to `character`.
Data summary
Name rabbits_all_states_clean
Number of rows 109891
Number of columns 10
_______________________
Column type frequency:
character 7
numeric 2
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
scientificName 0 1 21 31 0 2 0
taxonConceptID 0 1 73 73 0 2 0
recordID 0 1 36 36 0 109891 0
dataResourceName 0 1 8 60 0 26 0
occurrenceStatus 0 1 7 7 0 1 0
geometry 0 1 11 27 0 84558 0
state 0 1 8 28 0 8 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
decimalLatitude 0 1 -33.70 4.01 -43.49 -36.61 -34.18 -32.73 -11.88 ▃▇▂▁▁
decimalLongitude 0 1 143.33 7.19 113.17 140.31 144.27 149.13 153.62 ▁▁▂▇▆

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
eventDate 0 1 1849-01-01 2023-10-01 01:12:00 2016-04-26 17352

Nice, all columns are complete (don’t have missing values). There are 109,891 unique recordID values and the same amount of observations, so we haven’t accidentally duplicated any values. Interestingly, there are 84,558 unique geometry values, so there must be several thousand observations that have overlapping locations. There are 8 state values, which makes sense.

Plenty of sense checks, I’m happy with the data cleaning of rabbits in Australia!

Let’s check our excluded data.

excluded_data %>% 
  group_by(excluded_notes) %>% 
  count()

Scaling to other species!

Apply all the functions to the European Red Fox

Now that we’ve cleaned the data for rabbits for all of Australia, let’s take our functions and apply them to a different species! Let’s try foxes (vulpes vulpes).

# Raw Data
raw_foxes <- load_galah_occurrence_data(
  species_name = "vulpes vulpes"
  )
## This query will return 127,963 records
## 
## Checking queue
## Current queue size: 1 inqueue . running .......
# Processed Data
clean_foxes <- raw_foxes %>% 
  remove_na_values() %>% 
  add_state_column() %>% 
  remove_not_within_abs_map_values()

# Excluded Data
excluded_data <- raw_foxes %>% 
  add_na_eventdate_to_excluded_data()

excluded_data <- raw_foxes %>% 
  add_na_spatial_to_excluded_data()

excluded_data <- raw_foxes %>% 
  remove_na_values() %>% 
  add_state_column() %>% 
  add_not_within_abs_map_to_excluded_data()

Sanity checks on the foxes data

Let’s look at the raw data

skim(raw_foxes)
Data summary
Name raw_foxes
Number of rows 127963
Number of columns 8
_______________________
Column type frequency:
character 5
numeric 2
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
scientificName 0 1 13 20 0 2 0
taxonConceptID 0 1 73 73 0 2 0
recordID 0 1 36 36 0 127963 0
dataResourceName 0 1 8 72 0 41 0
occurrenceStatus 0 1 7 7 0 1 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
decimalLatitude 152 1 -33.96 3.29 -41.8 -36.16 -34.10 -32.29 58.79 ▇▁▁▁▁
decimalLongitude 152 1 147.36 7.00 -117.2 145.26 149.45 151.00 153.62 ▁▁▁▁▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
eventDate 664 0.99 1870-04-01 2023-09-29 18:19:00 2014-07-28 15819

Sweet! There are 127,963 fox observations in this dataset. There are 152 missing spatial data. There are also 664 missing dates. So this is what we should expect as counts in the excluded_data dataset. Let’s check!

excluded_data %>% 
  group_by(scientificName, excluded_notes) %>% 
  count()

Amazing! Our functions works!!! Let’s look at our cleaned data.

clean_foxes %>% skim()
## Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined
## `sfl` provided. Falling back to `character`.
Data summary
Name Piped data
Number of rows 126813
Number of columns 10
_______________________
Column type frequency:
character 7
numeric 2
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
scientificName 0 1 13 20 0 2 0
taxonConceptID 0 1 73 73 0 2 0
recordID 0 1 36 36 0 126813 0
dataResourceName 0 1 8 71 0 30 0
occurrenceStatus 0 1 7 7 0 1 0
geometry 0 1 11 27 0 73830 0
state 0 1 8 28 0 8 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
decimalLatitude 0 1 -33.99 2.87 -41.80 -36.11 -34.10 -32.29 -12.57 ▃▇▁▁▁
decimalLongitude 0 1 147.44 6.17 113.62 145.27 149.49 151.00 153.62 ▁▁▁▃▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
eventDate 0 1 1870-04-01 2023-09-29 18:19:00 2014-07-29 15710

Everything looks good using skim() function.

Scaling to a lot more species!

Time to now apply these functions to a few more species. Let’s find a bunch more invasive animal species in Australia and put their scientific names into a vector. Then we can use this vector to run our functions and output the data as a list of different tibbles.

Let’s also create a new column indicating which species is which because there are some species that have multiple different scientific names.

These are the species I want to focus on:

  • European Rabbit (Oryctolagus cuniculus)
  • European Red Fox (Vulpes vulpes)
  • Cane Toad (Rhinella marina)
  • Feral Cat (Felis catus)
  • Feral Horse (Equus caballus)
  • Feral Pig (Sus scrofa)
  • Red Imported Fire Ant (Solenopsis invicta)

European Rabbit

This will take a few minutes to run.

# Raw Data
raw_rabbits <- load_galah_occurrence_data(
  species_name = "Oryctolagus cuniculus"
  )
## This query will return 113,383 records
## 
## Checking queue
## Current queue size: 1 inqueue . running ......
# Processed Data
clean_rabbits <- raw_rabbits %>% 
  remove_na_values() %>% 
  add_state_column() %>% 
  remove_not_within_abs_map_values() %>% 
  mutate(
    simpleName = "European Rabbit"
  )

# Excluded Data
excluded_data <- raw_rabbits %>% 
  add_na_eventdate_to_excluded_data()

excluded_data <- raw_rabbits %>% 
  add_na_spatial_to_excluded_data()

excluded_data <- raw_rabbits %>% 
  remove_na_values() %>% 
  add_state_column() %>% 
  add_not_within_abs_map_to_excluded_data()

European Red Fox

This will take a few minutes to run.

# Raw Data
raw_foxes <- load_galah_occurrence_data(
  species_name = "vulpes vulpes"
  )
## This query will return 127,963 records
## 
## Checking queue
## Current queue size: 1 inqueue . running .......
# Processed Data
clean_foxes <- raw_foxes %>% 
  remove_na_values() %>% 
  add_state_column() %>% 
  remove_not_within_abs_map_values() %>% 
  mutate(
    simpleName = "European Red Fox"
  )

# Excluded Data
excluded_data <- raw_foxes %>% 
  add_na_eventdate_to_excluded_data()

excluded_data <- raw_foxes %>% 
  add_na_spatial_to_excluded_data()

excluded_data <- raw_foxes %>% 
  remove_na_values() %>% 
  add_state_column() %>% 
  add_not_within_abs_map_to_excluded_data()

Cane Toad

# Raw Data
raw_cane_toads <- load_galah_occurrence_data(
  species_name = "rhinella marina"
  )
## This query will return 32,141 records
## 
## Checking queue
## Current queue size: 1 inqueue  running ...
# Processed Data
clean_cane_toads <- raw_cane_toads %>% 
  remove_na_values() %>% 
  add_state_column() %>% 
  remove_not_within_abs_map_values() %>% 
  mutate(
    simpleName = "Cane Toad"
  )

# Excluded Data
excluded_data <- raw_cane_toads %>% 
  add_na_eventdate_to_excluded_data()

excluded_data <- raw_cane_toads %>% 
  add_na_spatial_to_excluded_data()

excluded_data <- raw_cane_toads %>% 
  remove_na_values() %>% 
  add_state_column() %>% 
  add_not_within_abs_map_to_excluded_data()

Feral Cat

# Raw Data
raw_cats <- load_galah_occurrence_data(
  species_name = "felis catus"
  )
## This query will return 35,757 records
## 
## Checking queue
## Current queue size: 1 inqueue . running ...
# Processed Data
clean_cats <- raw_cats %>% 
  remove_na_values() %>% 
  add_state_column() %>% 
  remove_not_within_abs_map_values() %>% 
  mutate(
    simpleName = "Feral Cat"
  )

# Excluded Data
excluded_data <- raw_cats %>% 
  add_na_eventdate_to_excluded_data()

excluded_data <- raw_cats %>% 
  add_na_spatial_to_excluded_data()

excluded_data <- raw_cats %>% 
  remove_na_values() %>% 
  add_state_column() %>% 
  add_not_within_abs_map_to_excluded_data()

Feral Horse

# Raw Data
raw_horses <- load_galah_occurrence_data(
  species_name = "equus caballus"
  )
## This query will return 8,316 records
## 
## Checking queue
## Current queue size: 1 inqueue  running .
# Processed Data
clean_horses <- raw_horses %>% 
  remove_na_values() %>% 
  add_state_column() %>% 
  remove_not_within_abs_map_values() %>% 
  mutate(
    simpleName = "Feral Horse"
  )

# Excluded Data
excluded_data <- raw_horses %>% 
  add_na_eventdate_to_excluded_data()

excluded_data <- raw_horses %>% 
  add_na_spatial_to_excluded_data()

excluded_data <- raw_horses %>% 
  remove_na_values() %>% 
  add_state_column() %>% 
  add_not_within_abs_map_to_excluded_data()

Feral Pig

# Raw Data
raw_pigs <- load_galah_occurrence_data(
  species_name = "sus scrofa"
  )
## This query will return 22,436 records
## 
## Checking queue
## Current queue size: 1 inqueue  running ..
# Processed Data
clean_pigs <- raw_pigs %>% 
  remove_na_values() %>% 
  add_state_column() %>% 
  remove_not_within_abs_map_values() %>% 
  mutate(
    simpleName = "Feral Pig"
  )

# Excluded Data
excluded_data <- raw_pigs %>% 
  add_na_eventdate_to_excluded_data()

excluded_data <- raw_pigs %>% 
  add_na_spatial_to_excluded_data()

excluded_data <- raw_pigs %>% 
  remove_na_values() %>% 
  add_state_column() %>% 
  add_not_within_abs_map_to_excluded_data()

Red Imported Fire Ant

# Raw Data
raw_red_fire_ants <- load_galah_occurrence_data(
  species_name = "solenopsis invicta"
  )
## This query will return 125 records
## 
## Checking queue
## Current queue size: 1 inqueue  running
# Processed Data
clean_red_fire_ants <- raw_red_fire_ants %>% 
  remove_na_values() %>% 
  add_state_column() %>% 
  remove_not_within_abs_map_values() %>% 
  mutate(
    simpleName = "Red Imported Fire Ant"
  )

# Excluded Data
excluded_data <- raw_red_fire_ants %>% 
  add_na_eventdate_to_excluded_data()

excluded_data <- raw_red_fire_ants %>% 
  add_na_spatial_to_excluded_data()

excluded_data <- raw_red_fire_ants %>% 
  remove_na_values() %>% 
  add_state_column() %>% 
  add_not_within_abs_map_to_excluded_data()

Bulk-Sanity Checks on the clean data

Let’s put all of the cleaned datasets into one big tibble

clean_invasive_species_data <- clean_rabbits %>% 
  bind_rows(
    clean_foxes,
    clean_cane_toads,
    clean_cats,
    clean_horses,
    clean_pigs,
    clean_red_fire_ants
  )

clean_invasive_species_data

Quick skim check

Great! Looking good! Let’s do a quick skim().

skim(clean_invasive_species_data)
## Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined
## `sfl` provided. Falling back to `character`.
Data summary
Name clean_invasive_species_da…
Number of rows 327522
Number of columns 11
_______________________
Column type frequency:
character 8
numeric 2
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
scientificName 0 1 10 31 0 10 0
taxonConceptID 0 1 12 73 0 10 0
recordID 0 1 36 36 0 327522 0
dataResourceName 0 1 6 71 0 40 0
occurrenceStatus 0 1 7 7 0 1 0
geometry 0 1 11 27 0 195990 0
state 0 1 8 28 0 8 0
simpleName 0 1 9 21 0 7 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
decimalLatitude 0 1 -31.71 6.54 -43.49 -35.81 -33.80 -29.96 -10.14 ▂▇▂▁▁
decimalLongitude 0 1 145.01 7.50 112.94 142.04 147.43 150.71 159.08 ▁▁▃▇▆

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
eventDate 0 1 1849-01-01 2023-10-01 01:12:00 2014-11-17 29235

Excellent! There are 327,522 rows and the same amount of unique recordID values. There are 8 state values, which makes sense! There are 7 simpleName values, which makes sense as we only pulled 7 different invasive species.

Quick visualisation check

Let’s visually display all the data.

state2021 %>% 
  filter(!state_code_2021 %in% c("9", "Z")) %>% 
  ggplot() +
  
  # Ref [1]: Create background polygon of the ACT
  geom_sf(
    aes(geometry = geometry)
  ) +
  
  geom_point(
    data = clean_invasive_species_data,
    aes(
      x = decimalLongitude,
      y = decimalLatitude,
      colour = simpleName
    ),
    alpha = 0.2
  ) +
  
  coord_sf() +
  
  theme_minimal() +
  
  facet_wrap(vars(simpleName)) +
  
  labs(title = "Seven Invasive Animal Species in Australia")

Fantastic! It seems that there are a few observations that are to the east of the NSW border, which is likely an island. Very interesting distributions across Australia for all the different species.

I think the data cleaning is now complete! We’ve excluded a bunch of different data points that don’t fall into the state/territory borders of Australia and also excluded points that have NA values for their spatial longitude and latitude and don’t have an observation date.